Automated Annotation Workflow

This workflow uses the auto_annot tools from besca to newly annotate a scRNAseq dataset based on one or more preannotated datasets. Ideally, these datasets come from a similar tissue and condition.

We use supervised machine learning methods to annotate each individual cell utilizing methods like support vector machines (SVM) or logistic regression.

First, the traning dataset(s) and the testing dataset are loaded from h5ad files or made available as adata objects. Next, the training and testing datasets are corrected using scanorama, and the training datasets are then merged into one anndata object. Then, the classifier is trained utilizing the merged training data. Finally, the classifier is applied to the testing dataset to predict the cell types. If the testing dataset is already annotated (to test the algorithm), a report including confusion matrices can be generated.

In [1]:
import besca as bc
.local/lib/python3.7/site-packages/sklearn/externals/six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)
In [2]:
import scanpy as sc
import pkg_resources
import os

test load datasets with scvelo

Apparently the scv loader makes sure the adata objects are all in comparable format whereas the sc loader loads them as is.

In [3]:
adata_test = bc.datasets.pbmc3k_processed()
In [4]:
adata_test_orig  = bc.datasets.pbmc3k_processed()
In [5]:
adata_train1 = bc.datasets.Granja2019_processed()
In [6]:
adata_train2 = bc.datasets.Kotliarov2020_processed()

Concatenation does not lead to errors when the scv loader is used.

In [7]:
train001 = adata_train1.concatenate(adata_train2)
In [8]:
adata_train_list = [adata_train1, adata_train2]

Parameter specification

Give your analysis a name.

In [9]:
analysis_name = 'auto_annot_pubimage' # The analysis name will be used to name the output files

Specify column name of celltype annotation you want to train on.

In [10]:
celltype ='dblabel' # This needs to be a column in the .obs of the training datasets (and test dataset if you want to generate a report)

Choose a method:

  • linear: Support Vector Machine with Linear Kernel
  • sgd: Support Vector Machine with Linear Kernel using Stochastic Gradient Descent
  • rbf: Support Vector Machine with radial basis function kernel. Very time intensive, use only on small datasets.
  • logistic_regression: Standard logistic classifier iwth multinomial loss.
  • logistic_regression_ovr: Logistic Regression with one versus rest classification.
  • logistic_regression_elastic: Logistic Regression with elastic loss, cross validates among multiple l1 ratios.
In [11]:
method = 'logistic_regression'

Specify merge method. Needs to be either scanorama or naive.

In [12]:
merge = 'scanorama' # We recommend to use scanorama here

Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.

In [13]:
use_raw = False # We recommend to use False here

You can choose to only consider a subset of genes from a signature set or use all genes.

In [14]:
genes_to_use = 'all' # We suggest to use all here, but the runtime is strongly improved if you select an appropriate gene set

Column names need to be standardised so the function knows which columns to compare.

In [15]:
#adata_test.obs["dblabel"] = adata_test.obs.dbl
#adata_test_orig.obs["dblabel"] = adata_test_orig.obs.celltype3_original
adata_train_list[1].obs["dblabel"] = adata_train_list[1].obs.celltype3
In [16]:
adata_test.obs.dblabel.unique()
Out[16]:
[naive thymus-derived CD8-positive, alpha-beta ..., naive B cell, central memory CD4-positive, alpha-beta T cell, classical monocyte, IL7R-max CD8-positive, alpha-beta cytotoxic T ..., non-classical monocyte, naive thymus-derived CD4-positive, alpha-beta ..., CD8-positive, alpha-beta cytotoxic T cell, cytotoxic CD56-dim natural killer cell, CD1c-positive myeloid dendritic cell]
Categories (10, object): [naive thymus-derived CD8-positive, alpha-beta ..., naive B cell, central memory CD4-positive, alpha-beta T cell, classical monocyte, ..., naive thymus-derived CD4-positive, alpha-beta ..., CD8-positive, alpha-beta cytotoxic T cell, cytotoxic CD56-dim natural killer cell, CD1c-positive myeloid dendritic cell]
In [17]:
adata_train_list[0].obs.dblabel.unique()
Out[17]:
[naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, naive thymus-derived CD8-positive, alpha-beta ..., ..., IL7R-max CD8-positive, alpha-beta cytotoxic T ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
Length: 25
Categories (25, object): [naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
In [18]:
adata_train_list[1].obs.dblabel.unique()
Out[18]:
[cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, CD8-positive, alpha-beta cytotoxic T cell, ..., regulatory T cell, CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
Length: 14
Categories (14, object): [cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, ..., CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
In [19]:
adata_test.var.dtypes
Out[19]:
ENSEMBL           object
SYMBOL          category
n_cells            int64
total_counts     float32
frac_reads       float32
mean             float64
std              float64
dtype: object
In [20]:
adata_train_list[0].var.dtypes
Out[20]:
ENSEMBL           object
SYMBOL            object
feature_type    category
n_cells            int64
total_counts     float32
frac_reads       float32
dtype: object
In [21]:
adata_train_list[1].var.dtypes
Out[21]:
ENSEMBL         category
SYMBOL            object
feature_type    category
n_cells          float64
total_counts     float32
frac_reads       float32
dtype: object

Correct datasets (e.g. using scanorama) and merge training datasets

This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.

In [22]:
adata_train, adata_test_corrected = bc.tl.auto_annot.merge_data(adata_train_list, adata_test, genes_to_use = genes_to_use, merge = merge)
merging with scanorama
using scanorama rn
Found 207 genes among all datasets
[[0.         0.69287335 0.47963259]
 [0.         0.         0.9908147 ]
 [0.         0.         0.        ]]
Processing datasets (1, 2)
Processing datasets (0, 1)
Processing datasets (0, 2)
integrating training set

Train the classifier

The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance). The scaling will then be applied to the counts in the testing dataset and then the classifier is applied to the scaled testing dataset (see next step, adata_predict()). This function will run multiple jobs in parallel if if logistic regression was specified as method.

In [23]:
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype, njobs=10)
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 out of   5 | elapsed:  3.5min finished

Prediction

If in addition to the most likely class you would like to have all class probabilities returned use the following function. (This is only a sensible choice if using logistic regression.)

In [24]:
adata_predicted = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.0)

Output

The adata object that includes the predicted cell type annotation can be written out as h5ad file.

In [25]:
adata_predicted.write('./adata_predicted_18122020.h5ad')
... storing 'auto_annot' as categorical

If the testing dataset included already a cell type annotation, a report can be generated and written, which includes metrics, confusion matrices and comparative umap plots.

In [26]:
adata_predicted.obs
Out[26]:
CELL percent_mito experiment n_counts n_genes leiden score_lymphocyte_scanpy score_myeloid_scanpy score_Bcell_scanpy score_Tcells_scanpy ... myeloid leukocyte naive B cell naive thymus-derived CD4-positive, alpha-beta T cell naive thymus-derived CD8-positive, alpha-beta T cell neutrophil non-classical monocyte plasma cell plasmacytoid dendritic cell pro-B cell regulatory T cell
AAACATACAACCAC-1 AAACATACAACCAC-1 0.030153 pbmc3k_filtered 2421.0 781 5 0.365830 -1.269378 -0.312056 0.742338 ... 3.633735e-04 0.000030 0.028511 1.913184e-02 1.216331e-03 7.421688e-05 2.386059e-04 2.388976e-04 5.953194e-06 0.013633
AAACATTGAGCTAC-1 AAACATTGAGCTAC-1 0.037936 pbmc3k_filtered 4903.0 1352 1 0.404063 -0.850506 1.221926 -1.033879 ... 9.971868e-06 0.006936 0.000051 4.354055e-06 4.114081e-06 5.516813e-08 3.199274e-04 1.916650e-05 2.964063e-03 0.000062
AAACATTGATCAGC-1 AAACATTGATCAGC-1 0.008892 pbmc3k_filtered 3148.0 1131 6 0.722883 -0.795757 -0.311508 0.678192 ... 4.579188e-06 0.000005 0.089841 2.361565e-02 1.121691e-04 1.168688e-05 6.069592e-05 2.435257e-05 1.226935e-06 0.027859
AAACCGTGCTTCCG-1 AAACCGTGCTTCCG-1 0.017431 pbmc3k_filtered 2639.0 960 0 -0.028641 2.001340 -0.336593 -1.024017 ... 8.070248e-05 0.000054 0.000632 8.118871e-06 3.178352e-04 5.851025e-01 7.349903e-07 2.190592e-07 3.533919e-08 0.000572
AAACGCACTGGTAC-1 AAACGCACTGGTAC-1 0.016636 pbmc3k_filtered 2163.0 782 10 -1.325377 -1.325377 -0.213229 0.981284 ... 2.516755e-03 0.000066 0.381117 6.707871e-02 9.671702e-03 1.636688e-04 1.317690e-04 4.442684e-04 1.365402e-06 0.498350
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTCGAACTCTCAT-1 TTTCGAACTCTCAT-1 0.021092 pbmc3k_filtered 3461.0 1155 0 -1.394863 3.353787 -0.247567 -1.183916 ... 1.101507e-01 0.002143 0.001602 1.322324e-05 1.048036e-02 7.749108e-04 3.019427e-06 1.359353e-04 2.565580e-06 0.003373
TTTCTACTGAGGCA-1 TTTCTACTGAGGCA-1 0.009283 pbmc3k_filtered 3447.0 1227 1 -1.509175 -1.055424 1.252956 -1.227544 ... 5.948513e-08 0.000004 0.000005 8.316186e-07 6.201927e-07 1.604168e-08 9.995227e-01 9.721629e-07 1.548916e-04 0.000004
TTTCTACTTCCTCG-1 TTTCTACTTCCTCG-1 0.021971 pbmc3k_filtered 1684.0 622 1 0.692867 -0.598499 1.580936 -0.812423 ... 1.904832e-04 0.982387 0.000030 1.518159e-06 1.021629e-04 1.345294e-05 5.725952e-06 7.979426e-06 1.220316e-03 0.000077
TTTGCATGAGAGGC-1 TTTGCATGAGAGGC-1 0.020508 pbmc3k_filtered 1022.0 454 1 -0.857408 -0.857408 3.220181 -0.845517 ... 8.278645e-05 0.947537 0.000041 1.047308e-06 5.933388e-05 1.980516e-05 7.316373e-06 4.732187e-06 2.138288e-03 0.000042
TTTGCATGCCTCAC-1 TTTGCATGCCTCAC-1 0.008060 pbmc3k_filtered 1985.0 724 11 -1.300216 -0.101524 -0.156380 1.661736 ... 8.331261e-04 0.000175 0.470620 1.118394e-01 1.210934e-02 1.091443e-04 6.241950e-05 6.727024e-05 2.147404e-06 0.360543

2504 rows × 153 columns

In [27]:
adata_predicted = bc.st.clustering(adata_predicted, '.')
leiden clustering performed with a resolution of 1
WARNING: saving figure to file figures/umap.leiden.png
rank genes per cluster calculated using method wilcoxon.
mapping of cells to  leiden exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out
./labelings/leiden/WilxRank.gct written out
./labelings/leiden/WilxRank.pvalues.gct written out
./labelings/leiden/WilxRank.logFC.gct written out
In [29]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted, celltype=celltype, method=method, analysis_name=analysis_name,
                        train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
                        genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage.png
Confusion matrix, without normalization
Normalized confusion matrix
In [30]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig4_ondata.svg')
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4.svg')
WARNING: saving figure to file figures/umap.fig4_ondata.svg
WARNING: saving figure to file figures/umap.fig4.svg
In [31]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes, celltype,
                          normalize=False,
                          title=None, numbers =False,
                          cmap=plt.cm.Blues, adata_predicted= None, asymmetric_matrix = True): 

    matplotlib.use('Agg')
    
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if asymmetric_matrix == True:
        class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
        class_names_orig = np.unique(adata_predicted.obs[celltype])
        class_names_pred = np.unique(adata_predicted.obs['auto_annot'])
        test_celltypes_ind = np.searchsorted(class_names, class_names_orig)
        train_celltypes_ind = np.searchsorted(class_names, class_names_pred)
        cm=cm[test_celltypes_ind,:][:,train_celltypes_ind]
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots(figsize=(15,15))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax, shrink = 0.8)
    # We want to show all ticks...
    if asymmetric_matrix == True:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=class_names_pred, yticklabels=class_names_orig,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
    else:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=classes, yticklabels=classes,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
        
    ax.grid(False)
    #ax.tick_params(axis='both', which='major', labelsize=10)
    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    if numbers == True:
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                ax.text(j, i, format(cm[i, j], fmt),
                        ha="center", va="center",
                        color="white" if cm[i, j] > thresh else "black")
    #fig.tight_layout()
    return ax
In [32]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_confusion_matrix_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_confusion_matrix_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix

let's use a threshold

In [33]:
analysis_name = 'auto_annot_pubimage_threshold' # The analysis name will be used to name the output files
In [34]:
adata_predicted_threshold = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.7)
In [35]:
adata_predicted_threshold.write('./adata_predicted_threshold_18122020.h5ad')
... storing 'auto_annot' as categorical
In [36]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted_threshold, celltype=celltype, method=method, analysis_name=analysis_name,
                        train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
                        genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage_threshold.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage_threshold.png
Confusion matrix, without normalization
Normalized confusion matrix
In [37]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig4_threshold_ondata.svg')
sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4_threshold.svg')
WARNING: saving figure to file figures/umap.fig4_threshold_ondata.svg
WARNING: saving figure to file figures/umap.fig4_threshold.svg
In [38]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_confusion_matrix_threshold_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_confusion_matrix_threshold_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix
In [39]:
adata_predicted_wo_unknown = adata_predicted_threshold.copy()
adata_predicted_wo_unknown = bc.subset_adata(adata_predicted_wo_unknown, adata_predicted_wo_unknown.obs.auto_annot != 'unknown', raw=False)
bc.pl.riverplot_2categories(adata_predicted_wo_unknown, [celltype, 'auto_annot'])

let's check if the differences in annotation make sense

In [40]:
 ## PROVIDED WITH BESCA
gmt_file_IMM=pkg_resources.resource_filename('besca', 'datasets/genesets/HumanCD45p_scseqCMs6.gmt')
bc.tl.sig.combined_signature_score(adata_predicted, gmt_file_IMM)
WARNING: genes are not in var_names and ignored: ['FCRL4']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['ENPP3']
WARNING: genes are not in var_names and ignored: ['BRIP1', 'CASP8AP2', 'DSSC1', 'E2F8']
WARNING: genes are not in var_names and ignored: ['ANLN', 'CDCA2', 'CSK2']
WARNING: genes are not in var_names and ignored: ['FAP', 'THY1', 'DCN', 'COL1A1', 'COL1A2', 'COL6A3', 'CXCL14', 'LUM', 'COL3A1', 'DPT', 'ISLR', 'FDF7', 'PDGFRL']
WARNING: genes are not in var_names and ignored: ['TNFA', 'IL2', 'IL7A', 'IL12', 'IL13', 'IL21', 'IL22', 'IL23', 'CXCL1', 'CXCL5', 'CXCL11', 'CXCL12', 'CXCL13', 'CX3CL1', 'GM-CSF', 'GCSFCCL1', 'CCL7', 'CCL11', 'CCL12', 'CCL13', 'CCL17', 'CCL19', 'CCL22', 'CCL25', 'CCL24', 'CCL26', 'SDF1A', 'BCA1', 'MIP1B']
WARNING: genes are not in var_names and ignored: ['CLEC9A', 'LY6C1', 'SIGLECH']
WARNING: genes are not in var_names and ignored: ['CD34', 'CDH5', 'FLT4', 'ITCAM1', 'KDR', 'PECAM1', 'SELE', 'TEK', 'VCAM1', 'VWF']
WARNING: genes are not in var_names and ignored: ['CD34', 'CDH5', 'FLT4', 'KDR', 'PECAM1', 'SELE', 'TEK', 'VCAM1', 'VWF']
WARNING: genes are not in var_names and ignored: ['PECAM1', 'VWF', 'CDH5', 'ECSCR', 'CCL14', 'SLCO2A1', 'KDR', 'TIE1', 'ERG', 'FABP4']
WARNING: genes are not in var_names and ignored: ['IL9R', 'SLIGLEC10', 'SIGLEC8']
WARNING: genes are not in var_names and ignored: ['EPCAM', 'KRT19']
WARNING: genes are not in var_names and ignored: ['EGR3', 'TILPL2']
WARNING: genes are not in var_names and ignored: ['HLA-H', 'HLA-L', 'HLA-DRB2']
WARNING: genes are not in var_names and ignored: ['OAS1G']
WARNING: genes are not in var_names and ignored: ['ADGRE1', 'APOE', 'MSR1']
WARNING: genes are not in var_names and ignored: ['ENPP3']
WARNING: genes are not in var_names and ignored: ['PECAM1']
WARNING: genes are not in var_names and ignored: ['MIA', 'TYR', 'SLC45A2', 'CDH19', 'PMEL', 'SLC24A5', 'MAGEA6', 'GJB1', 'PLP1', 'PRAME', 'PAX3', 'S100A1', 'MLANA']
WARNING: genes are not in var_names and ignored: ['RGS5', 'SLIT2', 'BGN', 'TNC', 'CYR6', 'GFRA3', 'SLITRK6', 'AQP1']
WARNING: genes are not in var_names and ignored: ['IGHG1', 'IGHG2', 'IGHA1']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3', 'FCGR1', 'MSR1']
WARNING: genes are not in var_names and ignored: ['FCGR4', 'CD34', 'FCGR1']
WARNING: genes are not in var_names and ignored: ['LY6G', 'CD177']
WARNING: genes are not in var_names and ignored: ['TRDC']
WARNING: genes are not in var_names and ignored: ['IGHD', 'IGHM']
WARNING: genes are not in var_names and ignored: ['CEACAM8', 'MME']
WARNING: genes are not in var_names and ignored: ['CD34']
WARNING: genes are not in var_names and ignored: ['IGF1', 'ITGA8']
WARNING: genes are not in var_names and ignored: ['CDH2', 'JAG1', 'SMO', 'SOX9']
WARNING: genes are not in var_names and ignored: ['SOX9']
WARNING: genes are not in var_names and ignored: ['MME', 'MMP1', 'MMP2', 'PDGFRA', 'PECAM1', 'THY1', 'VCAM1']
WARNING: genes are not in var_names and ignored: ['TRADO']
WARNING: genes are not in var_names and ignored: ['APOE', 'CXCL12', 'CD209', 'MSR1']
WARNING: genes are not in var_names and ignored: ['CXCL11']
WARNING: genes are not in var_names and ignored: ['ANGTPL4', 'CXCL5']
WARNING: genes are not in var_names and ignored: ['CSF2', 'SPP4', 'IFNA1', 'IL2', 'TNFSF11']
WARNING: genes are not in var_names and ignored: ['IL17A', 'IL21', 'IL22']
WARNING: genes are not in var_names and ignored: ['CCR8', 'CSF2', 'CSCR4', 'IL13', 'IL5']
WARNING: genes are not in var_names and ignored: ['TRAC']
WARNING: genes are not in var_names and ignored: ['TRGC1', 'TRDC', 'TRDV2', 'TRDV1']
WARNING: provided gene list has length 0, scores as 0
WARNING: genes are not in var_names and ignored: ['TRGC2']
WARNING: genes are not in var_names and ignored: ['FLAMF1']
WARNING: genes are not in var_names and ignored: ['IL2']
WARNING: genes are not in var_names and ignored: ['EGR3']
WARNING: genes are not in var_names and ignored: ['CCXR3']
WARNING: genes are not in var_names and ignored: ['CCL22', 'CCL17', 'CCL19']
WARNING: genes are not in var_names and ignored: ['CLEC9A', 'XCR1', 'CLNK']
WARNING: genes are not in var_names and ignored: ['CLEC9A', 'PLET1', 'XCR1']
WARNING: genes are not in var_names and ignored: ['EPCAM', 'SIGLECG', 'PLET1', 'PPP1R1A']
WARNING: genes are not in var_names and ignored: ['ARG1']
WARNING: genes are not in var_names and ignored: ['SIGLECH']
WARNING: genes are not in var_names and ignored: ['C7', 'SIGLECG']
In [41]:
adata_predicted.var_names
Out[41]:
Index(['ISG15', 'TNFRSF4', 'CPSF3L', 'MRPL20', 'ATAD3C', 'C1orf86', 'RER1',
       'TPRG1L', 'TNFRSF25', 'TNFRSF9',
       ...
       'DSCR3', 'BRWD1', 'BACE2', 'SIK1', 'C21orf33', 'ICOSLG', 'SUMO3',
       'SLC19A1', 'S100B', 'PRMT2'],
      dtype='object', length=1719)
In [42]:
scores = [x for x in adata_predicted.obs.columns if 'CD45' in x]
scores
Out[42]:
['score_HumanCD45p_scseqCMs6_ActB_scanpy',
 'score_HumanCD45p_scseqCMs6_Activation_scanpy',
 'score_HumanCD45p_scseqCMs6_Basophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Bcells_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG1S_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG2M_scanpy',
 'score_HumanCD45p_scseqCMs6_Cafs_scanpy',
 'score_HumanCD45p_scseqCMs6_Cellcycle_scanpy',
 'score_HumanCD45p_scseqCMs6_Checkpoint_scanpy',
 'score_HumanCD45p_scseqCMs6_Cyto_scanpy',
 'score_HumanCD45p_scseqCMs6_Cytotox_scanpy',
 'score_HumanCD45p_scseqCMs6_DCR_scanpy',
 'score_HumanCD45p_scseqCMs6_DCrec_scanpy',
 'score_HumanCD45p_scseqCMs6_DCs_scanpy',
 'score_HumanCD45p_scseqCMs6_Eff_scanpy',
 'score_HumanCD45p_scseqCMs6_Endo_scanpy',
 'score_HumanCD45p_scseqCMs6_Endot_scanpy',
 'score_HumanCD45p_scseqCMs6_Endothelial_scanpy',
 'score_HumanCD45p_scseqCMs6_Eosinophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Epith_scanpy',
 'score_HumanCD45p_scseqCMs6_ExhB_scanpy',
 'score_HumanCD45p_scseqCMs6_Granulo_scanpy',
 'score_HumanCD45p_scseqCMs6_HLA_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAP_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAS_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifi_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifng_scanpy',
 'score_HumanCD45p_scseqCMs6_Macrophage_scanpy',
 'score_HumanCD45p_scseqCMs6_Mast_scanpy',
 'score_HumanCD45p_scseqCMs6_Megakaryocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMelan_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMesen_scanpy',
 'score_HumanCD45p_scseqCMs6_MemB_scanpy',
 'score_HumanCD45p_scseqCMs6_Memory_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo14_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo16_scanpy',
 'score_HumanCD45p_scseqCMs6_MoMa_scanpy',
 'score_HumanCD45p_scseqCMs6_Monocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_Myelo_scanpy',
 'score_HumanCD45p_scseqCMs6_MyeloSubtype_scanpy',
 'score_HumanCD45p_scseqCMs6_NKT_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcells_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcyt_scanpy',
 'score_HumanCD45p_scseqCMs6_NKnai_scanpy',
 'score_HumanCD45p_scseqCMs6_Naive_scanpy',
 'score_HumanCD45p_scseqCMs6_NaiveB_scanpy',
 'score_HumanCD45p_scseqCMs6_Neutrophil_scanpy',
 'score_HumanCD45p_scseqCMs6_NonEff_scanpy',
 'score_HumanCD45p_scseqCMs6_OMyelo_scanpy',
 'score_HumanCD45p_scseqCMs6_Others_scanpy',
 'score_HumanCD45p_scseqCMs6_Plasma_scanpy',
 'score_HumanCD45p_scseqCMs6_Pyro_scanpy',
 'score_HumanCD45p_scseqCMs6_Stemmess_scanpy',
 'score_HumanCD45p_scseqCMs6_StemmessS_scanpy',
 'score_HumanCD45p_scseqCMs6_Stromal_scanpy',
 'score_HumanCD45p_scseqCMs6_T4CM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAMCx_scanpy',
 'score_HumanCD45p_scseqCMs6_TEM_scanpy',
 'score_HumanCD45p_scseqCMs6_TMO_scanpy',
 'score_HumanCD45p_scseqCMs6_TMid_scanpy',
 'score_HumanCD45p_scseqCMs6_TNK_scanpy',
 'score_HumanCD45p_scseqCMs6_TStem_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemhi_scanpy',
 'score_HumanCD45p_scseqCMs6_TSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemlo_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh1_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh17_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh2_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd4_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd8_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcells_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcgd_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcytox_scanpy',
 'score_HumanCD45p_scseqCMs6_Teff_scanpy',
 'score_HumanCD45p_scseqCMs6_Tfh_scanpy',
 'score_HumanCD45p_scseqCMs6_TilCM_scanpy',
 'score_HumanCD45p_scseqCMs6_Tpexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Treg_scanpy',
 'score_HumanCD45p_scseqCMs6_Ttexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Ubi_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivExh_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivMem_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivNaive_scanpy',
 'score_HumanCD45p_scseqCMs6_aDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_allSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC1_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC2_scanpy',
 'score_HumanCD45p_scseqCMs6_cDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_epDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_general_scanpy',
 'score_HumanCD45p_scseqCMs6_moDC_scanpy',
 'score_HumanCD45p_scseqCMs6_pDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_uDCs_scanpy']

Indeed it seems like the classification of B cells is an improvement, whereas the varieties of T cells pose difficulties.

In [43]:
sc.pl.umap(adata_predicted, color= ["score_HumanCD45p_scseqCMs6_MemB_scanpy", "score_HumanCD45p_scseqCMs6_NaiveB_scanpy","CD4", "CD8A"], ncols = 2, wspace = 0.4, color_map = 'viridis',save= '.fig4_markers.svg')
WARNING: saving figure to file figures/umap.fig4_markers.svg
In [44]:
sc.pl.umap(adata_predicted, color= ["IL7R"], ncols = 2, wspace = 0.4, color_map = 'viridis')
In [ ]: